---
layout: docs
page_title: Recover from an outage
description: |-
  Discover techniques and steps to recover from a single node failure,
  multi-node failure, or a complete loss of quorum.
---

# Recover from an outage

Don't panic. This is a critical first step.

Depending on your [deployment configuration], it may take only a single server
failure for cluster unavailability. Recovery requires an operator to intervene,
but the process is straightforward.

~> This guide is for recovery from a Nomad outage due to a majority of server
nodes in a datacenter being lost. If you are looking to add or remove servers,
consult the [bootstrapping guide].

## Failure of a single server cluster

If you had only a single server and it has failed, try to restore operation by
restarting it. A single server configuration requires the
[`-bootstrap-expect=1`] flag. If the server cannot be recovered, you need to
bring up a new server. Consult the [bootstrapping guide] for more detail.

In the case of an unrecoverable server failure in a single server cluster, data
loss is inevitable since data was not replicated to any other servers. This is
why a single server deploy is **never** recommended.

## Failure of a server in a multi-server cluster

If you think the failed server is recoverable, the easiest option is to bring it
back online and have it rejoin the cluster with the same IP address. This will
return the cluster to a fully healthy state. Similarly, even if you need to
rebuild a new Nomad server to replace the failed node, you may wish to do that
immediately. Keep in mind that the rebuilt server needs to have the same IP
address as the failed server. Again, once this server is online and has
rejoined, the cluster will return to a fully healthy state.

Both of these strategies involve a potentially lengthy time to reboot or rebuild
a failed server. If this is impractical or if building a new server with the
same IP isn't an option, you need to remove the failed server. Usually, you can
issue a [`nomad server force-leave`] command to remove the failed server if it
is still a member of the cluster.

<Note>

 Your raft cluster will need to have a quorum of nodes available to
perform any online modifications to the Raft peer information. Membership
changes are written to the Raft log, and all Raft log writes require quorum.
If this is impossible, continue to **Failure of Multiple Servers in a
Multi-Server Cluster**

</Note>

If, for some reason, the Raft configuration continues to show any stale members,
you can use the [`nomad operator raft remove-peer`] command to remove the stale
peer server on the fly with no downtime.

Once you have made the membership changes necessary, you should verify the
current Raft state with the [`nomad operator raft list-peers`] command:

```shell-session
$ nomad operator raft list-peers
Node                   ID               Address          State     Voter
nomad-server01.global  10.10.11.5:4647  10.10.11.5:4647  follower  true
nomad-server02.global  10.10.11.6:4647  10.10.11.6:4647  leader    true
nomad-server03.global  10.10.11.7:4647  10.10.11.7:4647  follower  true
```

## Failure of multiple servers in a multi-server cluster

In the event that multiple servers are lost, causing a loss of quorum and a
complete outage, partial recovery is possible using data on the remaining
servers in the cluster. There may be data loss in this situation because
multiple servers were lost, so information about what's committed could be
incomplete. The recovery process implicitly commits all outstanding Raft log
entries, so it's also possible to commit data that was uncommitted before the
failure.

The [section below][] contains the details of the recovery procedure. You will
include the remaining servers in the `raft/peers.json` recovery file. The
cluster should be able to elect a leader once the remaining servers are all
restarted with an identical `raft/peers.json` configuration.

Any new servers you introduce later can be fresh with totally clean data
directories and joined using Nomad's `server join` command.

In extreme cases, it should be possible to recover with only a single remaining
server by starting that single server with itself as the only peer in the
`raft/peers.json` recovery file.

The `raft/peers.json` recovery file is final, and a snapshot is taken after it
is ingested, so you are guaranteed to start with your recovered configuration.
This does implicitly commit all Raft log entries, so should only be used to
recover from an outage, but it should allow recovery from any situation where
there's some cluster data available.

## Manual recovery using peers.json

To begin, stop all remaining servers. You can attempt a graceful leave, but it
will not work in most cases. Do not worry if the leave exits with an error. The
cluster is in an unhealthy state, so this is expected.

The `peers.json` file is not present by default and is only used when performing
recovery. This file will be deleted after Nomad starts and ingests this file.

Nomad automatically creates a `raft/peers.info` file on startup to mark that it
is on the current version of Raft. Do not remove the `raft/peers.info`
file at any time.

Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to
be implicitly committed, so this should only be used after an outage where no
other option is available to recover a lost server. Make sure you don't have any
automated processes that will put the peers file in place on a periodic basis.

The next step is to go to the [`-data-dir`][] of each Nomad server. Inside that
directory, there will be a `raft/` sub-directory. Create a `raft/peers.json`
file. Its contents will depend on the raft protocol version of your cluster.

### Raft protocol 3 peers.json specification

```json
[
  {
    "id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
    "address": "10.1.0.1:4647",
    "non_voter": false
  },
  {
    "id": "8b6dda82-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.2:4647",
    "non_voter": false
  },
  {
    "id": "97e17742-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.3:4647",
    "non_voter": false
  }
]
```

- `id` `(string: **required**)` - Specifies the `node ID` of the server. This
  can be found in the logs when the server starts up, and it can also be found
  inside the `node-id` file in the server's data directory.

- `address` `(string: **required**)` - Specifies the IP and port of the server
  in `ip:port` format. The port is the server's RPC port used for cluster
  communications, typically 4647.

- `non_voter` `(bool: _false_)` - This controls whether the server is a
  non-voter, which is used in some advanced [Autopilot] configurations. If
  omitted, it will default to false, which is typical for most clusters.

You can use this `jq` filter to create a `peers.json` file with the list of `alive` servers. Check the generated output and make any necessary changes.

```shell
$ nomad server members -json | jq '[ .[] | select(.Status == "alive") | {id: .Tags.id, address: "\(.Tags.rpc_addr):\(.Tags.port)", non_voter: false} ]'
```

### Raft protocol 2 peers.json specification

```json
["10.0.1.8:4647", "10.0.1.6:4647", "10.0.1.7:4647"]
```

Raft protocol version 2 peers.json files contain a list of IP:Port addresses for
each server. Note that the port should refer to the RPC port and not the HTTP
API port.

### Deploy peers.json to all server nodes

Create entries for all remaining servers. You must confirm that servers you do
not include here have indeed failed and will not later rejoin the cluster.

Deploy this file is the same across all remaining server nodes.

### Verify keyring on server nodes

<Warning>

Prior to Nomad 1.9.0, [key material][Key material] was never stored in Raft. This meant that
the `nomad agent snapshot save` command and snapshot agent did not save Nomad's
keyring. If you are using versions prior to Nomad 1.9.0, you should make sure you have backed up the keyring of at least one
server.

</Warning>

Go to the [`-data-dir`][] of each Nomad server. Inside that directory, there
will be a `keystore/` sub-directory with `.nks.json` files. Ensure that these
files exist on at least one server before continuing.

### Restart cluster nodes

At this point, you can restart all the remaining servers. Log lines will be
emitted as the servers ingest the recovery file:

```plaintext
...
2016/08/16 14:39:20 [INFO] nomad: found peers.json file, recovering Raft configuration...
2016/08/16 14:39:20 [INFO] nomad.fsm: snapshot created in 12.484µs
2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
2016/08/16 14:39:20 [INFO] nomad: deleted peers.json file after successful recovery
2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:4647 Address:10.212.15.121:4647}]
...
```

If any servers managed to perform a graceful leave, you may need to have them
rejoin the cluster using the [`server join`] command:

```shell-session
$ nomad server join <Node Address>
Successfully joined cluster by contacting 1 nodes.
```

It should be noted that any existing member can be used to rejoin the cluster as
the gossip protocol will take care of discovering the server nodes.

At this point, the cluster should be in an operable state again. One of the
nodes should claim leadership and emit a log like:

```plaintext
[INFO] nomad: cluster leadership acquired
```

You can use the [`nomad operator raft list-peers`] command to inspect the Raft
configuration:

```shell-session
$ nomad operator raft list-peers
Node                   ID               Address          State     Voter
nomad-server01.global  10.10.11.5:4647  10.10.11.5:4647  follower  true
nomad-server02.global  10.10.11.6:4647  10.10.11.6:4647  leader    true
nomad-server03.global  10.10.11.7:4647  10.10.11.7:4647  follower  true
```

[`-bootstrap-expect=1`]: /nomad/docs/configuration/server#bootstrap_expect
[`-data-dir`]: /nomad/docs/configuration#data_dir
[`nomad operator raft list-peers`]: /nomad/commands/operator/raft/list-peers
[`nomad operator raft remove-peer`]: /nomad/commands/operator/raft/remove-peer
[`nomad server force-leave`]: /nomad/commands/server/force-leave
[`nomad server force-leave`]: /nomad/commands/server/force-leave
[`server join`]: /nomad/commands/server/join
[autopilot]: /nomad/docs/manage/autopilot
[bootstrapping guide]: /nomad/docs/deploy/clusters/connect-nodes
[deployment configuration]: /nomad/docs/architecture/cluster/consensus#deployment_table
[section below]: #manual-recovery-using-peers-json
[Key material]: /nomad/docs/manage/key-management
[restore the keyring]: /nomad/docs/manage/key-management#restoring-the-keyring-from-backup
